Online clustering of parallel data streams

نویسندگان

  • Jürgen Beringer
  • Eyke Hüllermeier
چکیده

In recent years, the management and processing of so-called data streams has become a topic of active research in several fields of computer science such as, e.g., distributed systems, database systems, and data mining. A data stream can roughly be thought of as a transient, continuously increasing sequence of time-stamped data. In this paper, we consider the problem of clustering parallel streams of real-valued data, that is to say, continuously evolving time series. In other words, we are interested in grouping data streams the evolution over time of which is similar in a specific sense. In order to maintain an up-to-date clustering structure, it is necessary to analyze the incoming data in an online manner, tolerating not more than a constant time delay. For this purpose, we develop an efficient online version of the classical K-means clustering algorithm. Our method’s efficiency is mainly due to a scalable online transformation of the original data which allows for a fast computation of approximate distances between streams.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Online-Data-Mining auf Datenströmen: Methoden zur Clusteranalyse und Klassifikation

• J. Beringer and E. Hüllermeier. Efficient instance based learning on data streams. Adaptive optimization of the number of clusters in fuzzy clustering. Fuzzy clustering of parallel data streams. Adaptive optimization of the number of clusters in fuzzy clustering.

متن کامل

Fuzzy Clustering of Parallel Data Streams

The management and processing of so-called data streams has recently become a topic of active research in several fields of computer science, notably database systems and data mining. A data stream can roughly be thought of as a transient, continuously increasing sequence of time-stamped data. In this paper, we consider the problem of clustering parallel streams of real-valued data, that is to ...

متن کامل

Model-based clustering of high-dimensional data streams with online mixture of probabilistic PCA

Model-based clustering is a popular tool which is renowned for its probabilistic foundations and its flexibility. However, model-based clustering techniques usually perform poorly when dealing with high-dimensional data streams, which are nowadays a frequent data type. To overcome this limitation of model-based clustering, we propose an online inference algorithm for the mixture of probabilisti...

متن کامل

Probability Density Grid-based Online Clustering for Uncertain Data Streams

Most existing stream clustering algorithms adopt the online component and offline component. The disadvantage of two-phase algorithms is that they can not generate the final clusters online and the accurate clustering results need to be got through the offline analysis. Furthermore, the clustering algorithms for uncertain data streams are incompetent to find clusters of arbitrary shapes accordi...

متن کامل

On clustering large number of data streams

Data streams and their applications appear in several fields such as physics, finance, medicine, environmental science, etc. As sensor technology improves, sensor data rates continue to increase. Consequently, analyzing data streams becomes ever more challenging. Fast online response is a must for applications that involve multiple data streams, especially when the number of data streams is lar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Data Knowl. Eng.

دوره 58  شماره 

صفحات  -

تاریخ انتشار 2006